# DistSim: A performance model of large-scale hybrid distributed DNN training

Guandong Lu

Shanghai Jiao Tong University Shanghai Qi Zhi Institusion Shanghai, China lugd0525@sjtu.edu.cn

Yangjie Zhou

Shanghai Jiao Tong University Shanghai Qi Zhi Institusion Shanghai, China yj zhou@sjtu.edu.cn

Yanming Miao Huawei Technologies Co., Ltd

Shenzhen, China miaoyanming@huawei.com

Runzhe Chen

Shanghai Jiao Tong University Shanghai Qi Zhi Institusion Shanghai, China runzhe chen@sjtu.edu.cn

Rui Zhang

Huawei Technologies Co., Ltd Shenzhen, China zhangrui262@huawei.com

Zhifang Cai

Huawei Technologies Co., Ltd Shenzhen, China caizhifang@huawei.com Yakai Wang

Shanghai Jiao Tong University Shanghai Qi Zhi Institusion Shanghai, China wamgyakai@sjtu.edu.cn

Zheng Hu

Huawei Technologies Co., Ltd Shenzhen, China hu.zheng@huawei.com

Li Li

Shanghai Jiao Tong University Shanghai, China lilijp@sjtu.edu.cn

Jingwen Leng\*

Shanghai Jiao Tong University Shanghai Qi Zhi Institution Shanghai, China leng-jw@sjtu.edu.cn

#### **ABSTRACT**

With the ever-increasing computational demand of DNN training workloads, distributed training has been widely adopted. A combination of data, model and pipeline parallelism strategy, called hybrid parallelism distributed training, is imported to tackle the problem of deploying large-scale models. However, how to evaluate the hybrid strategy and the utilization of each device remains a challenge since existing works either profile on a real large-scale cluster with high time and money costs or only analyze a specific type of parallelism without considering the hybrid parallelism. In this work, we proposed DistSim, an event-based performance model to accurately analyze each device's computation and communication activities with low profiling costs. DistDim breaks down the model into events according to the given distributed strategy, which can be profiled on two nodes. Then DistSim leverages the hierarchy of different parallel strategies to generate the computation and communication event-flow from layer level to model level and finally the activity timeline of each device participating

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

CF '23, May 9-11, 2023, Bologna, Italy

© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-0140-5/23/05...\$15.00 https://doi.org/10.1145/3587135.3592200

Minyi Guo\* Shanghai Jiao Tong University Shanghai Qi Zhi Institusion Shanghai, China guo-my@cs.sjtu.edu.cn

in training. Experiment shows that DistSim can reach <4% errors when predicting distributing training batch time and <5% errors when predicting a single device's activity time in various hybrid strategy settings. We also provide a use-case of DistSim, automatically evaluate and search the best distributed training strategy, and find a hybrid strategy with at most 7.37× throughput improvement.

# **CCS CONCEPTS**

Computing methodologies → Modeling and simulation;
 Machine learning; Parallel algorithms.

### **KEYWORDS**

Distributed DNN training, performance model

#### **ACM Reference Format:**

Guandong Lu, Runzhe Chen, Yakai Wang, Yangjie Zhou, Rui Zhang, Zheng Hu, Yanming Miao, Zhifang Cai, Li Li, Jingwen Leng, and Minyi Guo. 2023. DistSim: A performance model of large-scale hybrid distributed DNN training. In 20th ACM International Conference on Computing Frontiers (CF '23), May 9–11, 2023, Bologna, Italy. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3587135.3592200

#### 1 INTRODUCTION

Deep Neural Networks (DNNs) have been widely used in many areas and tasks due to their superior accuracy. For better accuracy, researchers have been proposing larger and deeper networks and the number of model parameters is increasing exponentially from 0.34 billion in Bert-Large [6] to 175 billion GPT-3 [4]. It is impractical to deploy such large models on a single device. At the same time,

 $<sup>^{\</sup>star}\mathrm{Jingwen}$  Leng and Minyi Guo are the corresponding authors.

since it is more difficult in convergence for a large-scale model, the time to train a model is increasing rapidly, from 4 days in Bert [6] to 355 GPU years computed in GPT-3 [4].

To address these problems, several solutions including designing domain specific accelerators [13, 37] and efficient computing operators [11, 42, 43], model pruning and sparsity [9], quantization [10, 12], and distributed training come into the picture. In distributed training, several distributed training strategies have been proposed for efficient training: data parallelism [20], which increases the batch size to accelerate convergence; model parallelism [33] and pipeline parallelism [8, 15], which partitions model into several pieces that can be deployed on a single device. Depending on the combination of these strategies, called hybrid parallelism distributed training [30], large models are able to be trained in large-scale clusters with a reasonable speedup.

However, it still costs days to weeks to complete a training task [24] even with the help of hybrid distributed training. Not only hardware sharing [38] and resource management [21, 41] should be considered, it is also important to find the best training strategy in a large strategy search space. For a given distributed training strategy, it takes lots of time to initialize, partition model and profile several iterations to get the averaged throughput. When searching for a broad strategy space in large-scale clusters, it will cost a large amount of time and money before actual training to find such a solution. It is difficult to design an accurate throughput or utilization estimation without profiling. Some prior works use analytical models to define the computation time [31, 34], i.e., operation count divided by computation capacity, but they have a large accuracy gap compared to the actual training process. Meanwhile, research institutions or small companies usually do not own a large-scale cluster, which is required for evaluating their strategyfinding algorithms. As a result, they often need to rent clusters with additional expenses. For example, to rent a cluster with 2048 GPU devices, one needs to spend 7168 (3.5  $\times$  2048) USD per hour in an AWS cluster [1], before the actual training.

Therefore, it is important to have an accurate and cost-effective analysis tool to evaluate various distributed training strategies. Prior works like Daydream [45] and dPRO [14] try to address this challenge, but they only focus on computation and communication dependency in data parallelism. The complexity of hybrid parallelism distributed training has not been addressed yet.

In this work, we propose DistSim, an accurate simulator to analyze the performance of arbitrary distributed DNN training strategies in large-scale clusters. We dive into the relationships between each strategy in hybrid parallelism and develop our simulator with the following two insights. First, there exists extensive profiling redundancy in hybrid parallelism. For example, each replica in data parallelism performs the same forward and backward computation; each device in model parallelism computes the same operator with the same input and weight shape in one layer. There is no need to profile all replicas as the actual training does. Second, different parallelism strategy has different dependency level in hybrid distributed training, which we define as the hierarchical dependency of hybrid parallelism. Each strategy has its own data dependencies and is unrelated to others. For example, changing the partition size of model parallelism does not affect on the dependencies in pipeline

parallelism since pipeline parallelism is unrelated to the computation inside one layer – where model parallelism participates in.

Based on these insights, we define "event" to eliminate the redundancy of profiling computation and communication in hybrid parallelism. Without loss of generality, in clusters with homogeneous devices and no network hierarchy, the same computation and communication performed by different devices can be gathered into one event and need to be profiled only once. Hence the profiling of the whole training process in large-scale clusters can be reduced to a minimal number of 2 nodes (1 node is unable to profile the inter-node communication). All the events are used to construct a complete training timeline from the lowest level – model parallelism, to the highest level – data parallelism.

DistSim models the training process with the event-based construction. We evaluate the simulation accuracy of DistSim and show that DistSim is able to estimate the performance of different hybrid training strategies accurately. In addition, we provide a simple use case of DistSim: finding a better hybrid training strategy.

In summary, we make the following contributions:

- We identify the profiling redundancy in hybrid distributed training and define a concept of the event, which avoids redundant computation and communication, and reduces the profiling overhead from real large-scale clusters to only two nodes.
- We build DistDim, a performance model leveraging the hierarchy of different distributed training strategies to construct the timeline of each device using profiled events. We propose detailed modeling methods for data, pipeline, and model parallelism. DistSim has the ability to deal with any combination of these strategies in hybrid parallelism.
- We evaluate DistSim from different aspects and show that DistSim can evaluate different hybrid training strategies accurately both in batch-time − critical path (<4% error) and in single device's activity (<5% error). We also provide a use case of DistSim, searching for the best parallel strategy and finding a hybrid strategy with at most 7.37× throughput improvement in an unseen 48-layer model "Bert-exLarge".</p>

### 2 BACKGROUND AND MOTIVATION

This work presents an accurate and cost-effective performance model to analyze training time and device activities in distributed DNN training. It allows users to keep track of the detailed information of every device given a model to train and hybrid training strategies to apply. In this section, we first present a brief background of different distributed training strategies. We then present the motivation that such a new performance model is necessary. Specifically, we explain why existing performance models, including direct profiling or analytical model, are ineffective and insufficient.

#### 2.1 Hybrid Distributed DNN Training

To train a large-scale DNN models, prior works have exploited various parallelism as follows.

2.1.1 Data Parallelism. In data parallelism, every device holds the same model. Input data is partitioned among batches and distributed into devices. Each device independently computes the forward and backward steps, and the gradient is gathered to sync the model's



Figure 1: Examples of data and model parallelism.

state. Fig. 1(a) shows a brief timeline of data parallelism. Using data parallelism for DNN training increases the batch-size of the training process, which also speeds up the training with a large learning rate. As such, data parallelism is widely used.

There are two variants of data parallelism: Parameter-Server (PS) [5] and Ring-AllReduce [2]. In PS method, workers send their gradients to parameter servers and receive the latest state of the model. In Ring-AllReduce, all devices send a part of the gradients to aggregate, then broadcast its aggregation result. In this work, we mainly discuss and model ring-AllReduce-based data parallelism, as they are widely used by distributed training frameworks such as PyTorch Distributed [20] and Horovod [32].

- 2.1.2 Model Parallelism. Model parallelism is also named as tensor model parallelism by Megatron-LM [24] or horizontal parallelism by DistIR [31]. First used by training AlexNet [17], model parallelism splits the weight matrix into multiple devices. Each device computes with its partitioned weight matrix for the same input and gathers the partial output with other devices, as shown in Fig. 1(b). Model parallelism is useful when device memory is insufficient to deploy a large-scale model. Recently, Megatron-LM [24] has implemented model parallelism into distributed training frameworks for partitioning large language processing models.
- 2.1.3 Pipeline Parallelism. Although pipeline parallelism is sometimes also called inter-layer model parallelism or pipeline model parallelism [33], pipeline is the main concept to distinguish from the previously introduced model parallelism. Pipeline parallelism partitions models into layer-wise stages and deploys them into different devices. Devices only need to compute the stages in charge and send their output activation to the next stage.

The naive pipeline parallelism, only computing a layer once per iteration, reaches a low device utilization since worker has to wait until its input is prepared. This phenomenon is called pipeline bubbles. Prior work introduces micro-batch to overlap training and proposes various algorithms to reduce bubbles, which can be categorized into synchronous and asynchronous pipeline parallelism.

Synchronous pipeline parallelism has a synchronized weight updating phase for all devices to guarantee convergence. Unlike the naive pipeline, the training batch is divided into micro-batches [15], which can be computed in a pipelined approach. Prior works, such as GPipe [15], Dapple [8] and Chimera [19] introduce various scheduling algorithms to increase utilization and reduce memory consumption. Fig. 2 shows a brief example of GPipe (a) and Dapple



Figure 2: An example of two algorithms in pipeline parallelism to reduce pipeline bubbles: GPipe (a) and Dapple (b). The number in each stage represents micro-batch index.

(b), which explains how these algorithms overlap computing for micro-batches to reduce bubbles.

Asynchronous pipeline parallelism removes the universal weight-updating phase along devices. Each device updates its own weight and computes different micro-batches simultaneously, which reaches a high device utilization. However, although previous experiments [23] show harmlessness, the lack of universal synchronization to update weight may affect training convergence. Thus we focus only on discussing synchronous pipeline parallelism in this work.

2.1.4 Hybrid Parallelism. Prior work OWT [16] is the first to use hybrid parallelism, which is not strictly a new category of distributed training strategies, but partitions the model into more than one aspect and leverages the advantages of different strategies of data, model, and pipeline parallelism. The representative work of hybrid parallelism includes PTD-P from Megatron-LM [33] and 3D-parallelism in DeepSpeed [30].

# 2.2 Why Not Profiling Performance Directly?

To model performance, the most straightforward solution is to actually run and profile the training process on a real cluster to get the iteration time or devices' activities. However, this solution has a shortcoming: users must have access to the *real* training cluster, which could be time- and money- consuming. For example, when researchers want to profile and analyze the utilization of devices in a particular hybrid training strategy with 2048 GPUs, which is a typical setting in Megatron-LM [24], they have to find such a 2048-device cluster to profile. Such cluster is uncommon to be seen and is very expensive to rent  $(3.5 \times 2048 = 7168 \text{ USD})$  per hour in an AWS cluster [1]). Unlike profiling on a single device, the high overhead in clusters makes it hard to directly profile, which is the first challenge of performance modeling.



Figure 3: The iteration time comparison between analytical model and actual profiling result on the training of Bert-Large[6] with 4 to 16 A40 GPUs. The x-axis shows the notations of different distributed training strategies, which is defined in Sec.5.1.

#### 2.3 Why Not Analytical Model?

Another way for performance modeling is to evaluate analytically with the information of training hardware and DNN training workload. For example, one common heuristic approach as prior work does [31][34] is: using the division of floating-point operators count and hardware computing capacity (FLOPS) to represent computation time and regarding the division of data transmission size and the bandwidth as the communication time. One problem of the heuristic approach is accuracy, as it is not often true that the DNN operators will full-utilize the hardware's computing resource [44].

To show this problem, we evaluate the heuristic approach with Bert-Large [6] model and 4-16 A40 GPUs with different distributed training strategies. The comparison of iteration time (from which the throughput of the training process can be conducted) presented in Fig. 3 shows a significant bias from heuristic to real profiling, with at most 40.4% error and 26.1% on average. The high divergence of the heuristic evaluation makes results unreliable for further analysis, such as analyzing per-GPU utilization.

# 2.4 Why Not Using Current Distributed Training Simulator?

A third possible approach is using current distributed training simulators such as Daydream [45] and Dpro [14], as they can profile kernels and construct the training process without running them. Both simulators are constructed based on a critical assumption: tasks in distributed DNN training workloads are highly sequential [45], which means when one device finishes computing the first layer, it will naturally launch the second layer. This assumption enables them to ignore the dependencies among layers. Although the assumption is true in data parallelism, it is not always correct for other parallelism strategies. For example, after one device finishes

Table 1: A summary and comparison of different ways to model performance of large-scale DNN training, to show our motivation to a general and light-weighted modeling approach

| Methodology      | Representative work      | Support<br>data<br>parallelism | Support<br>model/pipeline<br>parallelism | Low<br>profiling<br>overhead | Accu-<br>racy |
|------------------|--------------------------|--------------------------------|------------------------------------------|------------------------------|---------------|
| Direct profiling | Megatron-LM [24]         | 0                              | 0                                        | ×                            | 0             |
| Analytical Model | DistIR [31], accPar [34] | 0                              | 0                                        | 0                            | ×             |
| Simulation-based | daydream [45], dPro [14] | 0                              | ×                                        | 0                            | 0             |
|                  | DistSim (Ours)           | 0                              | 0                                        | 0                            | 0             |



Figure 4: An illustration for profiling redundancy in hybrid training, including: same computation among micro-batches (1) and different devices (2), as well as transferring activations whose shape is the same (3) and weights with same size (4).

computing the first layer, it may switch to the next micro-batch of pipeline parallelism and compute the first layer again, which breaks the computing sequence in DNN training. Therefore, these simulators can replay the training process of data parallelism, but cannot support other strategies in hybrid parallelism. We regard this problem as the complexity of dependencies in hybrid parallelism, which is shown as the second challenge of modeling.

Summary. Prior approaches to model performance of large-scale hybrid distributed DNN training have their shortcomings. Particularly, because of the unsatisfying accuracy, it is not suitable to use an analytical approach to model performance, and profiling is the best choice. Directly profiling and using current profiling-based simulators will face the challenges of profiling consumption and the complexity of dependencies. Therefore there is a need for a light-weight and generalized approach: profile on small-scale clusters and extrapolate towards the full-scale training process, which is what we proposed in this work, presented in Table 1.

### 3 DISTSIM

Our method to model large-scale distributed training performance is based on two observations. In this section, we first introduce these observations, and then show how DistSim solves the challenges of profiling and dependency raised in the previous section.

#### 3.1 Observations

**Observation 1: Profiling redundancy.** While training with data parallelism, certain devices perform the same computation since they all correspond to the same model. One device may compute the same layer for different micro-batches in pipeline parallelism. The computation and communication redundancy are shown in Fig. 4. In addition, some devices may be idle, waiting for input from other devices (recall pipeline bubbles in Sec. 2.1.1). Those repeated computing time or idle time will be profiled multiple times in direct training, which is unnecessary.

Observation 2: Hierarchical dependency of different parallel strategies. Different parallel strategies need to synchronize or communicate for different data, which is shown in Fig. 5: data parallelism synchronizes weights, model parallelism gathers partial output, and pipeline parallelism communicates activation for



Figure 5: An illustration for hierarchical dependencies in hybrid distributed training. Data parallelism focuses on the weight dependencies across models among data parallelism. The dependencies within a model is pipeline parallelism in charge, and finally, they are controlled by model parallelism in one layer.

different stages. Based on this fact, every strategy only focuses on its own dependency and ignores others'. This observation reduces the problem of handling dependence complexity to finding a way to combine each strategy's dependency together.

#### Overview of DistSim

Fig. 6 is an illustration of how DistSim works. Given a model and a particular configuration of parallel strategy, DistSim reduces the redundancy of profiling (Observation 1) by defining an abstraction named event to discover and preserve identical operators. As these events are essentially operators deployed on individual devices, they can be profiled only once and without large-scale clusters. These events, which can be divided into computation and communication events based on the type of operators, are profiled by DistSim separately according to events' type.

After profiling and getting the detailed time information of events, DistSim recovers the timeline of the original training process by further defining and combining composed-event, which contains several events. Composed-event aims to make each distributed strategy only focus on its own dependency (Observation 2) when modeling. Each strategy's modeling constructs an event list differently (saved in composed-event) and DistSim combines these event lists together. The final timeline can be generated when combining these event lists with profiled event times.

The output of DistSim is a detailed execution timeline for the fullscale distribution training, which contains when and which device will compute and communicate for certain operators. The timeline can be used by users for further analysis, such as iteration-time, device utilization, and pipeline bubble analysis.

In DistSim, we profile to get the correct events' elapsed time. When users do not have a profiling device or want a light-weight and profiling-free way by sacrificing some accuracy, they can alternately use GPU simulators such as MGPUSim [35] and operator predictors such as Habitat [39]. In addition, the events' time can be stored and reused when modeling a new parallelism strategy as long as the model can generate the same event.



Figure 6: An overview of DistSim, which takes the model and

distributed training configurations as input. DistSim first generate an abstraction called events from sub-models, then profiles and uses events to construct and recover the whole training process.

#### 4 FRAMEWORK DESIGN DETAIL

In this section, we describe the details in DistSim that complete the process in Fig. 6, including (i) generating events; (ii) profiling events; (iii) hierarchical modeling based on these events.

#### 4.1 Event Generator

We leverage the model partition function in current distributed training frameworks such as PyTorch Distributed [20]. During actual training, with the model and parallel strategy configurations inputted, frameworks will generate per-device sub-models to deploy on each distributed rank.

DistSim takes over these sub-models before frameworks actually deploy them, and parses all computation and communication operators. Then DistSim gathers all identical operators to a data structure called *event*, including computation events and communication events according to the name of these operators. Events use the operator name, parameters and input shape to distinguish from others. In addition, a supplementary attribute is defined for communication events to show whether the communication is intra-node or inter-node. This attribute is defined by recognizing whether the source and destination rank is on the same node. Finally, the input model is transferred into a set of events at this phase.

# 4.2 Profiling Events

Computation events, which run on one device, can be easily profiled by current profilers such as CUPTI [25]. However, communication events are more complicated with multiple devices involved.

In our DistSim, we further divide communication events into point-to-point communication (usually used in activation transmission) and all-reduce communication (usually used in activation gathering and weight synchronization). In the following paragraphs, we describe how to profile these two communication events.

Point-to-point Communication Event Profiling. The main challenge of profiling point-to-point communication event is that



Figure 7: Example of the queuing time in the point-to-point communication event. It is not correct to only profile "SEND" time

we cannot simply adopt the computation profiling methods into communication because of the queuing time. The transmission cannot be established unless both sender and receiver have called their functions, as shown in Fig. 7.

Based on the observation in Dpro [14] that the transmission begins when the second function in "SEND" and "RECV" launches and finishes when both functions exit, the actual transmission time is supposed to be the minimum of "SEND" and "RECV" calling time. Thus, we profile both the sender and receiver during a point-to-point communication event and use the minimum of the two profiling results as the elapsed time of this event.

**All-reduce Communication Event Profiling.** Unlike point-to-point communication events, the number of devices participating in all-reduce communication can be greater than 2. When the number of devices is large, it becomes difficult to directly profile the process with limited GPUs.

Prior work [2] has shown that for an all-reduce operator with N devices participating in, there will be two-phase transmission: gather and broadcast. In the gather phase, input tensor will be split into N pieces so that each device transfers one piece for N-1 times to get one partial sum. Then in the broadcast phase, each device broadcasts the partial sum N-1 times to all other devices. Thus, the total transmission amount per device is  $2(N-1) \times P/N$  (where P is the size of tensor), which can be deduced from N and is unrelated to device number N when N is large.

We adopt this conclusion into our all-reduce communication event profiling. When 8 or fewer devices are involved, we directly run and profile the all-reduce operator; when the number is over 8, we profile the all-reduce process for 8 GPUs, and calculate the time using the above formula that takes into account the actual number of GPUs being used. During our evaluation in Sec.5.2, we find this calculation's effect on predicting iteration time is limited (<2%).

#### 4.3 Hierarchical Modeling

DistSim would construct the complete training timeline of the training process using model, pipeline, and data parallelism modeling sequentially based on the hierarchical dependencies for different distributed training strategies.

Model Parallelism Modeling. In model parallelism modeling, the set of events and model parallelism size will be used to generate the mapping from layers to events as output. When modeling parallelism is 1, layers will be mapped to a single computation event. Otherwise, the layers will be mapped to a composite event with multiple devices, each containing a computation event and an all-reduce communication event.

**Pipeline Parallelism Modeling.** The layer-to-events mapping from the model parallelism modeling is used as input, together

#### **Algorithm 1:** Pipeline Training Modeling Algorithm

```
Input: Pipeline Scheduling : S(D, L, M), layer-to-events mapping :
           MAP(L \rightarrow E), model parallelism size : MP
   Output: The event-list of pipeline : F
  // In S, D/L/M represents the set of devices, layers and micro-batches
     respectively
   F \leftarrow LIST[D \times MP]; // Initialize event-list
  while S \neq \emptyset do
        // Find the first stage in the schedule that matches restrictions, and also
          return time-stamp to insert the event
        st(d, l, m), t \leftarrow first\_available(S);
        e \leftarrow MAP[l];
        // Get the point-to-point communication event for the transmission
          between stages
         e_{comm} \leftarrow get\_comm\_event(st);
        for device m_p \in [1...MP] do
             F[d \times MP + m_p] \leftarrow
10
               F[d \times MP + m_p] \cup (e, t, m) \cup (e_{comm}, t + e.elapsed, m)
11
        end
        S \leftarrow
13 end
14 return F
```

with the pipeline schedule generated from the pipeline algorithm (currently DistSim has implemented Dapple [8] and GPipe [15]) and micro-batch-size. The pipeline scheduling contains the mapping from each events towards the device index and the chronological order of each micro-batch's forward and backward.

The pipeline parallelism modeling follows Algorithm 1 to construct the event-lists of pipeline training. First, the event-flow generator finds the first stage in the schedule which is available (Line 5) (i.e., data is prepared and micro-batch index is matched). The corresponding event is selected and an additional point-to-point communication event is also added if the stage needs to transfer activation to the next stage (Line 6-8). Then these events will be added to all the devices participating in computing this stage because of the model parallelism (Line 9-11). The generator will continue to add events into the event-list until the pipeline schedule is traversed.

If no pipeline strategy is involved, the pipeline schedule will become a sequential flow with only one stage, and the pipeline parallelism size will be set to 1. After pipeline modeling, the generator gets the event-list of  $MP \times PP$  devices, where MP/PP represents model parallelism size and pipeline parallelism size, respectively.

**Data Parallelism Modeling.** The event-list will be expanded from  $MP \times PP$  devices into  $MP \times PP \times DP$  devices by duplicating all the events DP times, where DP is the data parallelism size. Additionally, an all-reduce communication event will be added at the end of each event-lists according to the gradient size to be reduced.

### 5 EVALUATION

To show the accuracy of simulation towards real performance, we evaluate the following metrics:

- The time consumption of one iteration in the training process (batch time). It reflects the accuracy of the critical path of training and can be used to evaluate the performance of certain strategies or estimate the whole training time given a certain number of training epochs. (Sec. 5.2)
- The timestamps of the beginning and the ending of computation events. It represents the consistency of each device's activity,



Figure 8: The evaluation of batch-time (iteration time) between DistSim and actual result on Bert-Large (a), GPT-2-345m (b) and Text-to-Text Transfer Transfermer(T5) (c). The x-axis is the strategy of hybrid parallelism distributed training.



Figure 9: The evaluation of detailed GPUs' activities between DistSim and actual result on Bert-Large (a), GPT-2-345m (b) and T5(c). The x-axis is the strategy of hybrid parallelism distributed training. Each bar in a certain strategy represents one GPU's activity error.

which helps users to apply more computations when one GPU is idle to achieve more utilization. (Sec. 5.3)

- A detailed comparison of per-stage timestamps. In this metric, we show the differences for *every pipeline stage* and *every micro-batch* between DistSim and actual running, which shows similarities between the modeled timeline and actual timeline. It helps programmers to locate pipeline bubbles and performs practical operations such as fault-tolerance [18, 22, 26] during bubbles.(Sec. 5.4)
- We also evaluate DistSim to model the training for a large-scale network with 128 GPUs, compared with Megatron-LM[24]. (Sec. 5.5)

# 5.1 Experiment Setup

**Testbed.** We evaluate the accuracy of DistSim in a cluster with up to 16 Nvidia-A40 GPUs on 4 servers. We use the whole 16 GPUs for real distributed training performance tracing, and 4 GPUs with 2 servers for DistSim profiling and simulation. We use PyTorch-Distributed with CUDA 11.6 and NCCL backend as our evaluation framework. We leverage the ability of model partition and generation in Megatron-LM [33] as our model partitioner in DistSim.

**Benchmarks.** We train three models, BERT-Large [6], GPT-2-345M [27] and Text-to-Text Transfer Transformer(T5) [28], on 1, 2 and 4 servers with 4, 8, 16 GPUs as our evaluation workload. We select various distributed training strategy configurations to evaluation, denoted as "xM xP xD", which means the size of each dimension of parallelism (model, pipeline, and data parallelism accordingly).

#### 5.2 Overall Accuracy Evaluation

We evaluate DistSim's prediction accuracy of iteration time, called overall batch-time evaluation. We first generate event-flow with DistSim and calculate the analyzed iteration time. Then we profile on a real cluster to get the accurate iteration time. We compare the simulated and actual batch-time and analyze the error.

We apply the evaluation to various hybrid parallelism strategies and three different models described in the benchmarks. The evaluation result shows in Fig. 8. We find that DistSim evaluates the iteration time in high accuracy, with at most 3.51% errors, much lower than analytical approach presented in Fig. 3.

During the evaluation, we notice that the error has weak relationship with the number of GPUs and the parallelism strategies. For example, in Bert-Large training, the error in less GPUs ("1M2P2D", 4 GPUs, 2.45%) is larger than more GPUs ("2M2P4D", 16 GPUs, 1.63%). We address this phenomenon as the random fluctuation during profiling since we only profile 100 iterations to get the average batch-time.

#### 5.3 Per-GPU Detailed Accuracy Evaluation

We evaluate DistSim's simulation result of every GPU. In the event-flow generated by DistSim, we select all the events' beginning and ending timestamps and calculate the average bias from the actual timeline. The results in Fig. 9 show high accuracies of the GPU-level modeling, with at most 4.19% errors.

We find that some GPUs' errors are higher than the average level (e.g., GPU 9-12 in GPT-2 "2M2P4D" training). There are mainly three reasons. First, we find during our evaluation that the profiling fluctuation affects individual devices' activities more than batch-time since the fluctuation will accumulate and affect other GPUs. Second, we use rank 0's clock time as a global standard for simplification. Therefore, it introduces the time alignment problem addressed in dPRO [14], and the actual timestamp should be adjusted. Third, there is no device synchronization in data parallelism until weight all-reduce, so there exist timeline variations among different data parallelism ranks.



Figure 10: The detailed per-stage comparison result for Bert[6] model with model parallelism size 2 and pipeline parallelism 4. The micro-batch size is set to 4. The x-axis is the training phase (forward, backward) of a certain micro-batch and the y-axis is the median error for the DistSim simulation and 100-times actual running. Each bar in one stage represents the result of GPU0-7, respectively.

Additionally, our evaluation has shown that when more GPUs are involved, the margin of error seems to increase. After further analysis, we find that the error positively correlates with the pipeline parallelism size. This is because errors in earlier pipeline stages can impact later stages, resulting in a greater likelihood of deviations from the analyzed model's timeline in larger pipeline parallelism scenarios.

# 5.4 Per-stage Accuracy Evaluation

We also evaluate after DistSim's modeling, how similar for each forward and backward computation is between DistSim and the golden result. We use the parallel configuration "2m4p1d" and micro-batch size 4, therefore it will include 32 forward and backward stages in total, with 4 per GPU. The difference between the start and finish timestamps according to the whole training is regarded as the error. We run the real training iteration 100 times, and the error distribution is presented in Fig. 10.

In the result, we find that the largest medium error among every stage and every GPU is 1.71%, which shows the low bias from the simulation and actual running. Additionally, the error distribution for every two GPUs (0 and 1, 2 and 3, etc.) is generally the same, which shows the computation of different model parallelism parts can be regarded as the same. We also find interestingly that only the error in the first stage of the first 2-4 GPUs is approaching 0 and others are mainly equal. This is because we use the start timestamp of the first stage as the global standard time for both simulated and actual timelines. Therefore, the error for this timestamp is 0.

#### 5.5 Large-scale Generalization Evaluation

To evaluate how DistSim performs on a real large-scale model, we use DistSim to model the training process of a 145-billion-parameter GPT model with 128 GPUs. The distributed configuration is "8M16P1D", which is given by Megatron-LM [24].

We compare the modeling result with data reported by Megatron-LM [24]. Because the profiling hardware and the bandwidth is different with Megatron-LM, we only compare the normalized throughput according to batch-size 1 to evaluate whether the modeling is in accordance with fact. The result is in Fig. 11. We find that the throughput increment rate has high similarities between DistSim and Megatron-LM, which shows the ability of DistSim to model large-scale training.



Figure 11: A thoughput comparison between modeling by DistSim and actual running reported by Megatron-LM on 128 GPUs, training for a GPT model with 145 billion parameters using model parallelism size 8 and pipeline parallelism 16. The reported data can be found in Fig.17 in the experiments of Megatron-LM [24].

# 6 USE-CASE: AUTO PARALLEL STRATEGY SEARCH

In this section, we provide a simple yet important use-case of Dist-Sim. Since DistSim can accurately evaluate the training process's throughput with different hybrid parallelism strategies, it can be used to find an optimal strategy before using the actual cluster.

The strategy search is applied on new unseen model called "BERT-exLarge" with 48 transformer layers on 4 new nodes with 16 A10 GPUs. Suppose the global batch-size is fixed into 16, then the throughput evaluation can be regarded as iteration time.

We use a grid-search method to traverse the hybrid parallelism search space with the help of DistSim. There are 5 configuration choices for each of the parallelism dimension – 1, 2, 4, 8 and 16. Overall, there are 15 different hybrid parallelism settings as some combinations are invalid. We only need to traverse along the pipelineand the model- parallelism axis as data parallelism size can be conducted from the above two (DP = GPUs/MP/PP).

We evaluate all the settings with DistSim, and show the results in Fig. 12. The result indicates that the strategy whose data parallelism size 2 and pipeline parallelism size 8 is optimal with throughput over 2.94 iterations per second, which at most increases  $7.37 \times$  from the worst strategy (model parallelism size 16).

Next, we run on an actual 16 GPUs cluster to verify whether the searched strategy is reliable. Table. 2 shows high similarities between DistSim's grid search and the accurate profiling result.



Figure 12: BERT-exLarge training strategy grid-search results using DistSim on 4 nodes with 16 GPUs. The red bar (data and pipeline parallelism size of 2 and 8) achieves the best throughput. Note that the unreachable configurations are drawn to 0.

In addition, we also analyze the cost of time in profiling and simulation compared to direct running, which shows in Table. 3. We find that DistSim only takes about 12.96% of time compared to direct running due to the reduction of profiling redundancy. The main cost in profiling is some all-reduce operators since their redundancy is limited. For example, in "2m1p8d" strategy, there are only two same all-reduce operators with 8 GPUs involved, and we can only reduce 50% profiling redundancy. More redundancy could be discovered and reduced on a larger scale. Also, from the result, we find that simulation time only occupies a tiny portion of the total time cost (<1%), so if the profiling time can be reduced, the total cost could be further eliminated.

#### 7 RELATED WORK AND DISCUSSION

Distributed Training Performance Modeling. DistIR[31] defines a distributed intermediate representation (IR) to represent the computation and communication flow of distributed training, aiming to reveal the behavior of each rank (device) and deploy each rank's work more conveniently. However, DistIR is simply an analytical model, which shows unsatisfying accuracy (over 30% on average) between actual training and simulated training, limiting its usage for optimizing the performance and throughput. Daydream [45] and dPRO [14] both focus on using a profiler and simulator to evaluate the performance and bottleneck in distributed data-parallel training. Daydream [45] proposes a kernel-level profiler based on CUPTI [25] to get a kernel-level time and maps kernel to model operators to get the operator computation and communication cost. Then they use a simulator to replay the training process and identify how much throughput improves when using a certain optimization method. Similarly, dPRO [14] leverages the frameworks' profiling ability (TensorFlow or PyTorch) to get the operator-level time. They also propose a global data-flow graph (DFG) to simulate and replay the training process when certain

Table 2: Comparison between DistSim-based distributed training strategy grid-search and actual measurement.

|         | best strategy<br>(iter/s) | second-best<br>strategy<br>(iter/s) | worst strategy<br>(iter/s) | speed up |
|---------|---------------------------|-------------------------------------|----------------------------|----------|
| DistSim | 2.94                      | 2.92                                | 0.398                      | 7.379×   |
| Actual  | 2.97                      | 2.90                                | 0.396                      | 7.488×   |

Table 3: The comparsion of time cost between DistSim and actual run while searching for the best strategy.

|            | Simulate Time (s) | Profiling GPU<br>Time (gpu×s) | Relative Scale |
|------------|-------------------|-------------------------------|----------------|
| DistSim    | 0.14              | 49.18                         | 0.1296×        |
| Direct Run | -                 | 380.35                        | 1×             |

optimization of distributed training is used. However, those approaches only considered data parallelism in distributed training. Their methods fail to analyze the situation when a model is too large to deploy on a single device and other distributed training strategies (e.g., pipeline or model parallelism) must be imported.

Modeling in Auto Parallel Strategy Search. To find the optimal parallel strategy, many works, such as Piper [36], AccPar [34], PaSE [7] and Alpa [40], construct their cost models toward the training throughput and propose algorithms heuristically to search the optimal. Piper, PaSE and Alpa define cost models according to the latency of training iteration. Then they use dynamic programming approaches to search for the least latency according to their different search spaces. AccPar [34] uses a layer-wise recursive algorithm to find the optimal partition of each layer in the data and model parallelism domain. Although these approaches can find a state-of-the-art distributed training strategy, they are unable to evaluate how much throughput improves and why their approaches outperform from devices' activity aspect unless evaluated on real clusters. Moreover, since most of the works use analytical assumptions [7, 34, 36], such as "computing time is equal to the division of operator count and computing capacity", the result of the cost models only shows the advantage of one strategy towards other strategies, but cannot report their detailed statistics.

#### Discussion: Scalability towards New Strategies, Algorithms,

Communication Operators, Models, etc. It is important to analyze whether DistSim can handle unseen distributed training configurations. For new distributed strategies such as ZeRO-DP [29] and 3D-parallelism [3], DistSim will take these strategies as input. Their dependencies can be recognized. For example, ZeRO-DP has a combination of intra-layer and inter-model dependencies. Therefore, the model can be partitioned from these strategies, DistSim can generate events and perform modeling. For new algorithms such as asynchronous pipeline parallelism like Pipedream [23], the schedule in pipeline parallelism modeling can still be established only without a global synchronize event (usually an all-reduce event), so pipeline modeling can still complete. For new communication or computation operators in unseen models, DistSim can regard them as new events and perform profiling normally.

#### 8 CONCLUSION

It is important to evaluate and analyze the performance of throughput and utilization in distributed DNN training. DistSim is a new performance model that can achieve high accuracy with low profiling cost in combining data, model and pipeline parallelism. DistSim achieves this by two key contributions: (1) the same computation and communication processed by different devices, resulting in profiling redundancy; (2) the hierarchy of different parallelism strategies with different partition granularity, which makes it possible to model the training process step by step. Our evaluation shows that DistSim achieves high accuracy in modeling the training process with different strategy configurations and can be used in various tasks such as training strategy autotuning.

#### ACKNOWLEDGMENTS

This work was supported by the National Key R&D Program of China under Grant 2022YFB4501401, the National Natural Science Foundation of China (NSFC) grant (62222210, and 62072297, and 61832006). The authors would like to thank the anonymous reviewers for their constructive feedback for improving the work. Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors.

#### **REFERENCES**

- [1] Amazon. [n. d.]. AWS Pricing Calculator. https://calculator.aws/.
- [2] Baidu. [n.d.]. Ring AllReduce. https://github.com/baidu-research/baiduallreduce.
- [3] Zhengda Bian, Qifan Xu, Boxiang Wang, and Yang You. 2021. Maximizing Parallelism in Distributed Training for Huge Neural Networks. CoRR abs/2105.14450 (2021). arXiv:2105.14450 https://arxiv.org/abs/2105.14450
- [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901.
- [5] Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC '20). Association for Computing Machinery, New York, NY, USA, 507–521. https://doi.org/10.1145/341911.3421307
- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805
- Venmugil Elango. 2021. Pase: Parallelization Strategies for Efficient DNN Training. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1025–1034. https://doi.org/10.1109/IPDPS49936.2021.00111
- [8] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, and Wei Lin. 2021. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 431–445. https://doi.org/10.1145/3437801.3441593
- [9] Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
- [10] Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo. 2022. SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation. In *International Conference* on Learning Representations. https://openreview.net/forum?id=JXhROKNZzOc

- [11] Cong Guo, Yuxian Qiu, Jingwen Leng, Chen Zhang, Ying Cao, Quanlu Zhang, Yunxin Liu, Fan Yang, and Minyi Guo. 2022. Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training. In 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 738–745.
- [12] Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2022. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1414–1433.
- [13] Cong Guo, Yangjie Zhou, Jingwen Leng, Yuhao Zhu, Zidong Du, Quan Chen, Chao Li, Bin Yao, and Minyi Guo. 2020. Balancing efficiency and flexibility for DNN acceleration via temporal GPU-systolic array integration. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
- [14] Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4. 623–637.
- [15] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.
- [16] Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014). arXiv:1404.5997
- [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.), Vol. 25. Curran Associates, Inc.
- [18] Jingwen Leng, Alper Buyuktosunoglu, Ramon Bertran, Pradip Bose, Quan Chen, Minyi Guo, and Vijay Janapa Reddi. 2020. Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 44–57. https://doi.org/10.1109/HPCA47549.2020.00014
- [19] Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 27, 14 pages. https://doi.org/10.1145/3458817.3476145
- [20] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. CoRR abs/2006.15704 (2020). arXiv:2006.15704
- [21] Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. 2022. VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '22). Association for Computing Machinery, New York, NY, USA, 388–401. https://doi.org/10.1145/3503222.3507752
- [22] Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 203–216. https://www.usenix.org/conference/fast21/presentation/mohan
- [23] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP '19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3341301.3359646
- [24] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 58, 15 pages. https://doi.org/10.1145/3458817.3476209
- [25] NVIDIA. [n. d.]. CUPTI. https://docs.nvidia.com/cuda/cupti/.
- Yuxian Qiu, Jingwen Leng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, and Yuhao Zhu. 2019. Adversarial defense through network profiling based path extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4777–4786.
- [27] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018).
- [28] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR abs/1910.10683 (2019). arXiv:1910.10683

- [29] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16. https://doi.org/10.1109/SC41405.2020.00024
- [30] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-Speed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 3505–3506. https://doi.org/10.1145/3394486.3406703
- [31] Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, Andrew Fitzgibbon, and Tim Harris. 2021. DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks (EuroMLSys '21). Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/3437984.3458829
- [32] Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799
- [33] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019). arXiv:1909.08053
- [34] Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2020. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture (IFPCA). 342–355. https://doi.org/10.1109/HPCA47549.2020.00036
- [35] Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 197–209.
- [36] Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and I. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 24829–24840.
- [37] Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-side sparse tensor core. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1083–1095.

- [38] Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of Service Support for Fine-Grained Sharing on GPUs. SIGARCH Comput. Archit. News 45, 2 (jun 2017), 269–281. https://doi.org/10. 1145/3140659.3080203
- [39] Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 503–521. https://www.usenix.org/conference/atc21/presentation/yu
- [40] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. CoRR abs/2201.12023 (2022). arXiv:2201.12023
- [41] Xiaojie Zhou, Kun Wang, Weijia Jia, and Minyi Guo. 2017. Reinforcement learning-based adaptive resource management of differentiated services in geo-distributed data centers. In 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS). 1–6. https://doi.org/10.1109/IWQoS.2017.7969161
- [42] Yangjie Zhou, Jingwen Leng, Yaoxu Song, Shuwen Lu, Mian Wang, Chao Li, Minyi Guo, Wenting Shen, Yong Li, Wei Lin, et al. 2023. uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 878–891.
- [43] Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. 2021. Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In 2021 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 214– 225.
- [44] Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload Characterization (IISWC). 88–100. https://doi.org/10.1109/IISWC.2018.8573476
- [45] Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 337–352.